Skip to content

Conversation

@whybe-choi
Copy link
Contributor

@whybe-choi whybe-choi commented Oct 7, 2025

Close #3268

This pull request adds new BRIGHT subset benchmarks and their corresponding descriptive statistics to the retrieval benchmark suite. These changes enable more granular, domain-specific evaluation for reasoning-intensive retrieval tasks, both for standard and long document formats.

Benchmark additions

  • Introduced two new benchmarks, BRIGHT_SUBSETS and BRIGHT_SUBSETS_LONG, to the mteb/benchmarks/benchmarks/benchmarks.py file, covering individual domains of the BRIGHT benchmark for both standard and long document retrieval tasks. [1] [2]
  • Registered the new benchmarks in the mteb/benchmarks/benchmarks/__init__.py file for import and usage. [1] [2]

Descriptive statistics

  • Added descriptive statistics JSON files for each new BRIGHT subset retrieval task, including both standard and long formats (e.g., BrightBiologyRetrieval.json, BrightBiologyLongRetrieval.json, etc.), detailing sample counts, text lengths, and relevant document statistics for each domain. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]

Minor improvement

  • Minor formatting fix in the BEIR_NL benchmark description for improved readability.

@whybe-choi whybe-choi changed the title refactor: split BRIGHT benchmark into individual subset tasks refactor: split BRIGHT benchmark into individual subset tasks Oct 7, 2025
@Samoed Samoed requested a review from Muennighoff October 7, 2025 13:40
@whybe-choi whybe-choi force-pushed the bright-subset-tasks branch from 4240bdb to 826990a Compare October 7, 2025 14:36
KennethEnevoldsen

This comment was marked as resolved.

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this change will invalidate all previous results on BRIGHT.

You know that you can also simply subselect from a task using:

task = mteb.get_task("BrightRetrieval", eval_splits=..., hf_subet=...)

For the leaderboard display it is even possible to create custom summary tables (see e.g. #3272)

@Samoed
Copy link
Member

Samoed commented Oct 7, 2025

You know that you can also simply subselect from a task using:

Yes, but BRIGHT requires different prompts for different subsets, and because of that we probably need to split it. We can add support to configure prompts for different subsets, but I'm not sure if it good idea

@KennethEnevoldsen
Copy link
Contributor

Yes, but BRIGHT requires different prompts for different subsets, and because of that we probably need to split it. We can add support to configure prompts for different subsets, but I'm not sure if it good idea

Ohh... Yeah that is hard to fix.

I see that the original BRIGHT(long) only has four models and BRIGHT only has 12, so I guess it is possible to rerun them

@Muennighoff
Copy link
Contributor

If the scores change, are the new scores more similar or more different from the official scores? If closer then I think it is fine & maybe we can rerun some models. I think that for many models on our BRIGHT leaderboard I just converted the scores from https://brightbenchmark.github.io/ to MTEB format when we originally added so they may be still fine if these changes actually make our implementation closer to that one.

@Samoed Samoed added the new dataset Issues related to adding a new task or dataset label Oct 7, 2025
@whybe-choi
Copy link
Contributor Author

Would it be enough to evaluate the performance of ReasonIR, or is there a list of other models that would be good enough to test?

@Samoed
Copy link
Member

Samoed commented Oct 8, 2025

To check implementation, this will be enough, just don't update old leaderboard

@whybe-choi whybe-choi force-pushed the bright-subset-tasks branch from 826990a to 3ed620f Compare October 8, 2025 11:33
@whybe-choi whybe-choi force-pushed the bright-subset-tasks branch from 3ed620f to 57c757f Compare October 8, 2025 11:53
@whybe-choi
Copy link
Contributor Author

After split BrightRetrieval into multiple tasks, I ran ReasonIR on them with task-specific prompts using the following code:

import torch
import mteb

# https://github.com/facebookresearch/ReasonIR/tree/main/evaluation/bright/configs/reasonir
prompts_dict = {
    "BrightBiologyRetrieval": "Given a Biology post, retrieve relevant passages that help answer the post",
    "BrightEarthScienceRetrieval": "Given a Earth Science post, retrieve relevant passages that help answer the post",
    "BrightEconomicsRetrieval": "Given a Economics post, retrieve relevant passages that help answer the post",
    "BrightPsychologyRetrieval": "Given a Psychology post, retrieve relevant passages that help answer the post",
    "BrightRoboticsRetrieval": "Given a Robotics post, retrieve relevant passages that help answer the post",
    "BrightStackoverflowRetrieval": "Given a Stackoverflow post, retrieve relevant passages that help answer the post",
    "BrightSustainableLivingRetrieval": "Given a Sustainable Living post, retrieve relevant passages that help answer the post",
    "BrightPonyRetrieval": "Given a Pony question, retrieve relevant passages that help answer the question",
    "BrightLeetcodeRetrieval": "Given a coding problem, retrieve relevant examples that help answer the problem",
    "BrightAopsRetrieval": "Given a Math problem, retrieve relevant examples that help answer the problem",
    "BrightTheoremQATheoremsRetrieval": "Given a Math problem, retrieve relevant theorems that help answer the problem",
    "BrightTheoremQAQuestionsRetrieval": "Given a Math problem, retrieve relevant examples that help answer the problem",
}

tasks = mteb.get_tasks(tasks=list(prompts_dict.keys()), languages=["eng"])
evaluation = mteb.MTEB(tasks=tasks)

model = mteb.get_model(
    "ReasonIR/ReasonIR-8B",
    model_kwargs={"torch_dtype": torch.bfloat16},
    prompts_dict=prompts_dict,
)

evaluation.run(
    model,
    save_predictions=True,
    output_folder="evaluation/results",
    encode_kwargs={"batch_size": 1},
)

The results are as follows:

  Bio. Earth. Econ. Psy. Rob. Stack. Sus. Leet. Pony AoPS TheoQ. TheoT. Avg.
before split 24.31 30.83 24.27 28.95 18.40 21.68 20.57 18.14 9.49 4.84 18.21 26.42 20.51
after split 26.18 30.71 23.96 29.76 18.62 21.15 19.89 19.65 9.22 5.12 18.34 27.12 20.81

In the paper:
image

@Samoed
Copy link
Member

Samoed commented Oct 9, 2025

Great results! But I'm a bit unsure does prompts applied correctly when they're passing thought get_model?

@whybe-choi
Copy link
Contributor Author

if instruction:
logger.info(f"Using instruction: '{instruction}' for task: '{task_name}'")
embeddings = self.model.encode(
sentences,
prompt=instruction,
**kwargs,
)
if isinstance(embeddings, torch.Tensor):
# sometimes in kwargs can be return_tensors=True
embeddings = embeddings.cpu().detach().float().numpy()
return embeddings

After adding code to print the instruction inside the code, the following output was produced:

# Biology
Retrieval
    - BrightBiologyRetrieval, s2p


instruction: <|user|>
Given a Biology post, retrieve relevant passages that help answer the post<|embed|>

Batches: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 103/103 [00:06<00:00, 15.80it/s]
instruction: <|embed|>

Batches:   0%|                                                                                            | 2/50000 [00:02<18:01:38,  1.30s/it
# Psychology
Retrieval
    - BrightPsychologyRetrieval, s2p


instruction: <|user|>
Given a Psychology post, retrieve relevant passages that help answer the post<|embed|>

Batches: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 101/101 [00:07<00:00, 14.12it/s]
instruction: <|embed|>

Batches:   0%|                                                                                                       | 0/50000 [00:01<?, ?it/s]
# Aops
Retrieval
    - BrightAopsRetrieval, s2p


instruction: <|user|>
Given a Math problem, retrieve relevant examples that help answer the problem<|embed|>

Batches: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 111/111 [00:06<00:00, 16.13it/s]
instruction: <|embed|>

Batches:   0%|                                                                                            | 17/50000 [00:09<7:16:33,  1.91it/s]

@Samoed
Copy link
Member

Samoed commented Oct 9, 2025

Interesting, thanks! I didn’t think that would work since it’s a bit unintended, but maybe we should update the code to handle this case.

I've checked code for ReasonIR and found some other places that can help to reproduce:

  1. For some cases, rewritten query is concatenated with query https://github.com/facebookresearch/ReasonIR/blob/0aac96269e455965949df16520fab72da68ffc22/evaluation/bright/run.py#L82-L87
  2. Sometimes reason trases added to the query https://github.com/facebookresearch/ReasonIR/blob/0aac96269e455965949df16520fab72da68ffc22/evaluation/bright/run.py#L124
  3. Maybe ids can be filtered (ref Excluded IDs missing from BRIGHT dataset #2696) but in ReasonIR code they're just check that no ids are intersect https://github.com/facebookresearch/ReasonIR/blob/0aac96269e455965949df16520fab72da68ffc22/evaluation/bright/run.py#L130-L131

@Muennighoff Can you help what we can do to reproduce results?

@Muennighoff
Copy link
Contributor

I think the IDs filtering is probably the main missing piece to fully reproduce results?

@whybe-choi
Copy link
Contributor Author

I think points 1 and 2 are a separate issue, as they are related to query expansion. The problem of the performance not being reproducible in the single ReasonIR model seems to be related to the issue mentioned in point 3.

Samoed and others added 4 commits October 20, 2025 21:56
# Conflicts:
#	mteb/benchmarks/benchmarks/__init__.py
#	mteb/tasks/Retrieval/__init__.py
#	mteb/tasks/retrieval/eng/BrightSubsetsLongRetrieval.py
#	mteb/tasks/retrieval/eng/BrightSubsetsRetrieval.py
@whybe-choi
Copy link
Contributor Author

@Samoed

I think it would be better to close this PR and work on it later together with Excluded IDs missing from BRIGHT dataset #2696. Also, should revise it to fit the v2 format and include descriptive stats as well. What do you think?

@Samoed
Copy link
Member

Samoed commented Oct 22, 2025

I think it would be better to close this PR and work on it later together

Do you mean that you don't want tasks in this pr and will add another PR for #2696?

Also, should revise it to fit the v2 format and include descriptive stats as well. What do you think?

Yes, you need to add statistic to merge. To apply v2 format, you can select subsets from https://huggingface.co/datasets/mteb/BrightRetrieval, but retrieval dataset loader reqired dataset to have strictly corpus, qrels and quries, maybe we need to reupload them instead

@whybe-choi
Copy link
Contributor Author

What tasks need to be redone for this PR? I'm confused about the changes with the v2 format, so I would appreciate your help.

@Samoed
Copy link
Member

Samoed commented Oct 22, 2025

I think we can solve #2696 in this pr, because otherwise we would need to create v2 versions of these tasks, which I think is not good solution

Comment on lines +22 to +38
domain_corpus_long = datasets.load_dataset(
path,
"long_documents",
split=domain,
cache_dir=cache_dir,
revision=revision,
)
examples = datasets.load_dataset(
path,
"examples",
split=domain,
cache_dir=cache_dir,
revision=revision,
)
corpus["long"] = {e["id"]: {"text": e["content"]} for e in domain_corpus_long}
queries["long"] = {e["id"]: e["query"] for e in examples}
relevant_docs["long"] = defaultdict(dict)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To follow v2 format, you can remove conversion dataset to dict and pass dataset directly.

domain_corpus_long = domain_corpus_long.rename_column("content", "text")
queries = queries.rename_column("query", "text")
...
return domain_corpus_long, queires, relevant_docs

if self.data_loaded:
return

self.corpus, self.queries, self.relevant_docs = load_bright_long_data(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And then here it should look like

self.dataset["default"]["long"]["corpus"], self.dataset["default"]["long"]["queries"], self.dataset["default"]["long"]["relevant_documents"]

You can refer to

class RetrievalSplitData(TypedDict):
"""A dictionary containing the corpus, queries, relevant documents, instructions, and top-ranked documents for a retrieval task.
Attributes:
corpus: The corpus dataset containing documents. Should have columns `id`, `title`, `text` or `image`.
queries: The queries dataset containing queries. Should have columns `id`, `text`, `instruction` (for instruction retrieval/reranking) or `image`.
relevant_docs: A mapping of query IDs to relevant document IDs and their relevance scores. Should have columns `query-id`, `corpus-id`, `score`.
top_ranked: A mapping of query IDs to a list of top-ranked document IDs. Should have columns `query-id`, `corpus-ids` (list[str]). This is optional and used for reranking tasks.
"""
corpus: CorpusDatasetType
queries: QueryDatasetType
relevant_docs: RelevantDocumentsType
top_ranked: TopRankedDocumentsType | None

@Samoed
Copy link
Member

Samoed commented Oct 24, 2025

Great! So for now most different task is pony?

@whybe-choi
Copy link
Contributor Author

whybe-choi commented Oct 24, 2025

Among the tasks with excluded_ids, pony seems to be the most different. The other tasks seem to have reproduced the performance reported in the paper to some extent.

@Samoed
Copy link
Member

Samoed commented Oct 24, 2025

task Paper PR Diff
Aops 14.7 15.6 +0.9
Biology 26.2 26.1 -0.1
Economics 23.3 24.0 +0.7
Pony 10.5 9.3 -1.2
Robotics 18.0 18.6 +0.6
StackOverflow 23.9 21.1 -2.8
TheoremQAQuestion 31.9 30.1 -1.8
TheoremQATheorem 27.2 26.5 -0.7

I think the main difference because of that you've evaluated shot version of datasets, but this is hard to get how tasks were produced in paper table. @Muennighoff Can you help with scores reproduction?

@Muennighoff
Copy link
Contributor

Scores looking really close, great work. Are you asking me whether in the paper they were evaluated with shots or without?

@Samoed
Copy link
Member

Samoed commented Oct 24, 2025

were evaluated with shots or without?

Yes

@Muennighoff
Copy link
Contributor

Yeah I think those specific paper results are zero-shot

@whybe-choi
Copy link
Contributor Author

I set max_seq_length to 32768 based on the following reference. Is this correct? :
https://github.com/facebookresearch/ReasonIR/blob/main/evaluation/bright/retrievers.py#L725-L726

@Muennighoff
Copy link
Contributor

Yeah that seems right to me (cc'ing @RulinShao in case she has thoughts on if we're missing sth for full reproduction or scores seem close enough)

@whybe-choi
Copy link
Contributor Author

I made a mistake by omitting a newline (\n) before <|embed|> in the query instruction. After correcting this and re-evaluating the performance, the score for the biology task was 26.87 (previously it was 26.1). Since this change is likely to affect the performance of other tasks as well, I will rerun the experiments and attach the updated results accordingly.

Also, I would like to ask if it would be better to handle this modification as a new PR.

def instruction_template(
instruction: str, prompt_type: PromptType | None = None
) -> str:
return (
# https://github.com/facebookresearch/ReasonIR/blob/0aac96269e455965949df16520fab72da68ffc22/evaluation/bright/configs/reasonir/economics.json#L3
f"<|user|>\n{instruction}<|embed|>\n"
if (prompt_type is None or prompt_type == PromptType.query) and instruction
else "<|embed|>\n"
)

# https://github.com/facebookresearch/ReasonIR/blob/main/evaluation/bright/configs/reasonir/biology.json
{
  "instructions": {
    "query": "<|user|>\nGiven a {task} post, retrieve relevant passages that help answer the post\n<|embed|>\n",
    "document": "<|embed|>\n"
  },
  "instructions_long": {
    "query": "<|user|>\nGiven a {task} post, retrieve relevant documents that help answer the post\n<|embed|>\n",
    "document": "<|embed|>\n"
  }
}

@Samoed
Copy link
Member

Samoed commented Oct 28, 2025

I think it would be better to make fix in separate PR

@whybe-choi
Copy link
Contributor Author

When I look at the original repository, it seems like the TASK_MAP variable exists but is not being used. Because of this, the task names included in the instruction are slightly different (e.g., Biology -> biology, Sustainable Living ->sustainable_living). I re-evaluated the performance to reflect this change.

The performance was measured based on the following code:

import torch
import mteb
import logging

logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
logger = logging.getLogger(__name__)

prompts_dict = {
    "BrightBiologyRetrieval": "Given a biology post, retrieve relevant passages that help answer the post",
    "BrightEarthScienceRetrieval": "Given a earth_science post, retrieve relevant passages that help answer the post",
    "BrightEconomicsRetrieval": "Given a economics post, retrieve relevant passages that help answer the post",
    "BrightPsychologyRetrieval": "Given a psychology post, retrieve relevant passages that help answer the post",
    "BrightRoboticsRetrieval": "Given a robotics post, retrieve relevant passages that help answer the post",
    "BrightStackoverflowRetrieval": "Given a stackoverflow post, retrieve relevant passages that help answer the post",
    "BrightSustainableLivingRetrieval": "Given a sustainable_living post, retrieve relevant passages that help answer the post",
    "BrightPonyRetrieval": "Given a pony question, retrieve relevant passages that help answer the question",
    "BrightLeetcodeRetrieval": "Given a coding problem, retrieve relevant examples that help answer the problem",
    "BrightAopsRetrieval": "Given a Math problem, retrieve relevant examples that help answer the problem",
    "BrightTheoremQATheoremsRetrieval": "Given a Math problem, retrieve relevant theorems that help answer the problem",
    "BrightTheoremQAQuestionsRetrieval": "Given a Math problem, retrieve relevant examples that help answer the problem",
}


model_path = "ReasonIR/ReasonIR-8B"
model_name = model_path.split("/")[-1]

model = mteb.get_model(
    "ReasonIR/ReasonIR-8B",
    model_kwargs={"torch_dtype": torch.bfloat16},
    max_seq_length=32768,
    prompts_dict=prompts_dict,
)
cache_dir = "evaluation/cache/bright"
for task_name in prompts_dict.keys():
    print(f"task: {task_name}")
    tasks = mteb.get_tasks(tasks=[task_name], languages=["eng"])
    cache = mteb.cache.ResultCache(cache_dir)

    try:
        mteb.evaluate(
            model,
            tasks,
            cache=cache,
            overwrite_strategy="only-missing",
            prediction_folder=f"{cache_dir}/predictions/{model_name.replace('/', '__')}",
            encode_kwargs={"batch_size": 1},
        )
        print(f"{task_name} completed successfully")
        torch.cuda.empty_cache()

    except torch.cuda.OutOfMemoryError:
        print(f"{task_name} skipped due to OOM error")
        torch.cuda.empty_cache()
        continue

The performance differences are as follows:

task Paper PR Diff
Aops 14.7 14.7 0
Biology 26.2 26.3 +0.1
Economics 23.3 23.8 +0.5
Pony 10.5 10.0 -0.5
Robotics 18.0 18.1 +0.1
StackOverflow 23.9 20.6 -3.3
TheoremQAQuestion 31.9 29.8 -2.1
TheoremQATheorem 27.2 26.7 -0.5

I'm not sure if the problem is with torch_dtype.

@Samoed
Copy link
Member

Samoed commented Oct 30, 2025

Interestingly, that score on most task dropped. On aops on 0.9, stackoverflow 0.5.

@whybe-choi You can try to reproduce scores for bge-large-en-v1.5, all-mpnet-base-v2, or bm25 models from bright paper https://arxiv.org/pdf/2407.12883, but I don't understand if they reported short or long versions too.

UPD. In BRIGHT paper I think in main table they reported short version, because they have long version in Table 39

@whybe-choi
Copy link
Contributor Author

whybe-choi commented Oct 30, 2025

To reproduce results for bge-large-en-v1.5, I conducted the evaluation based on the following code:

import torch
import mteb
import logging

logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
logger = logging.getLogger(__name__)

prompts_dict = {
    "BrightBiologyRetrieval-query": "Represent this biology post for searching relevant passages: ",
    "BrightEarthScienceRetrieval-query": "Represent this earth_science post for searching relevant passages: ",
    "BrightEconomicsRetrieval-query": "Represent this economics post for searching relevant passages: ",
    "BrightPsychologyRetrieval-query": "Represent this psychology post for searching relevant passages: ",
    "BrightRoboticsRetrieval-query": "Represent this robotics post for searching relevant passages: ",
    "BrightStackoverflowRetrieval-query": "Represent this stackoverflow post for searching relevant passages: ",
    "BrightSustainableLivingRetrieval-query": "Represent this sustainable_living post for searching relevant passages: ",
    "BrightPonyRetrieval-query": "Represent this Pony question for searching relevant passages: ",
    "BrightLeetcodeRetrieval-query": "Represent this Coding problem for searching relevant examples: ",
    "BrightAopsRetrieval-query": "Represent this Math problem for searching relevant examples: ",
    "BrightTheoremQATheoremsRetrieval-query": "Represent this Math problem for searching relevant theorems: ",
    "BrightTheoremQAQuestionsRetrieval-query": "Represent this Math problem for searching relevant examples: ",
}


model_path = 'BAAI/bge-large-en-v1.5'
model_name = model_path.split("/")[-1]

model = mteb.get_model(
    model_path,
    model_kwargs={"torch_dtype": torch.float32},
    tokenizer_kwargs={"max_seq_length": 512},
    model_prompts=prompts_dict,
)
cache_dir = "evaluation/cache/bright_v2"
for task_name in prompts_dict.keys():
    task_name = task_name.split("-")[0]
    print(f"task: {task_name}")
    tasks = mteb.get_tasks(tasks=[task_name], languages=["eng"])
    cache = mteb.cache.ResultCache(cache_dir)

    try:
        mteb.evaluate(
            model,
            tasks,
            cache=cache,
            overwrite_strategy="only-missing",
            prediction_folder=f"{cache_dir}/predictions/{model_name.replace('/', '__')}",
            encode_kwargs={"batch_size": 1},
        )
        print(f"✅ {task_name} completed successfully")
        torch.cuda.empty_cache()

    except torch.cuda.OutOfMemoryError:
        print(f"⚠️ {task_name} skipped due to OOM error")
        torch.cuda.empty_cache()
        continue

The results are as follows:

task Paper PR Diff
Biology 11.7 12.0 +0.3
Earth Science 24.6 24.2 -0.4
Economis 16.6 16.6 0
Psychology 17.5 17.5 0
Robotics 11.7 12.2 +0.5
Stackoverflow 10.8 9.5 -1.3
Sustainable Living 13.3 13.3 0
Leetcode 26.7 26.7 0
Pony 5.7 5.6 -0.1
AoPS 6.0 6.1 +0.1
TheoremQAQuestion 13.0 12.6 -0.4
TheoremQATheorem 6.9 5.5 -1.4

@Samoed
Copy link
Member

Samoed commented Oct 30, 2025

@whybe-choi You didn't add a code

@Samoed
Copy link
Member

Samoed commented Oct 30, 2025

I will try to rerun bge from bright repo

@Samoed
Copy link
Member

Samoed commented Oct 31, 2025

I run bge on earth, biology and pony and got same results as in paper

@whybe-choi
Copy link
Contributor Author

did I miss anything when evaluating bge model using mteb?

@Samoed
Copy link
Member

Samoed commented Oct 31, 2025

I don't know for now, I will try to dig dipper on weekends

@Samoed
Copy link
Member

Samoed commented Nov 5, 2025

I tried to debug it, but I'm getting slightly different embeddings for texts. For example with mteb I'm getting

Represent this Pony question for searching relevant passages: I will use the programming language pony.
Problem:
You are given two strings word1 and word2. Merge the strings by adding letters in alternating order, starting with word1. If a string is longer than the other, append the additional letters onto the end of the merged string. Write a funtion that returns the merged string.

Here is the code template:
fun mergeAlternately(word1: String, word2: String): String =>
...

[ 0.03160078  0.00635942  0.01886853 ... -0.01480357 -0.00048236
 -0.02248559]

But in BRIGHT repo I get

Represent this Pony question for searching relevant passages: I will use the programming language pony.
Problem:
You are given two strings word1 and word2. Merge the strings by adding letters in alternating order, starting with word1. If a string is longer than the other, append the additional letters onto the end of the merged string. Write a funtion that returns the merged string.

Here is the code template:
fun mergeAlternately(word1: String, word2: String): String =>
...

[ 0.03159456  0.00634383  0.01886596 ... -0.01478089 -0.00048517
 -0.02248857]

I even added to encode normalize_embeddings=True and used same dependencies as in BRIGHT repo, but still getting this difference. https://github.com/xlang-ai/BRIGHT/blob/d99e8391d967d4c2b3a74732530d2309e2fc92b6/retrievers.py#L243

Which leads to difference in result similarities, e.g. for ["0"]["Pony/src-math-is_prime-_0.txt"]

0.7551534175872803 # BRIGHT
0.7552036643028259 # mteb

So, I think scores are close as possible to integrate. I don't know how to reproduce this fully. NDCG@1 for pony is the same

@whybe-choi
Copy link
Contributor Author

I'm sorry to hear that. Is there any additional work I need to do on this PR?

@Samoed
Copy link
Member

Samoed commented Nov 5, 2025

I think you can add prompts to the tasks and make better description for them

@whybe-choi
Copy link
Contributor Author

I think it’s tricky to add a prompt because the format of the prompt varies for each model. For example, each model uses the following prompt for BrightBiologyRetrieval.

  • bge-large-en-v1.5: Represent this biology post for searching relevant passages:
  • ReasonIR: Given a biology post, retrieve relevant passages that help answer the post

@Samoed
Copy link
Member

Samoed commented Nov 5, 2025

I think you can add prompt like for bge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new dataset Issues related to adding a new task or dataset

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Results on BRIGHT not matching

4 participants